OCPBUGS-86310: Handle CA bundle aggregation delay by requeuing revocation#8563
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request adjusts the certificate revocation controller's trust-bundle propagation logic to gate operations on synchronization state. The controller now issues synthetic requeues in two scenarios: when the new signer certificate has not yet propagated to the trust bundle, and when the old signer certificate has not yet been revoked from the trust bundle. Tests were updated to expect these requeue outcomes. Sequence Diagram(s)sequenceDiagram
participant CertificateRevocationController
participant TotalClientTrustBundle
CertificateRevocationController->>TotalClientTrustBundle: trustedCertificates(newSigner)
TotalClientTrustBundle-->>CertificateRevocationController: count = 0
CertificateRevocationController->>CertificateRevocationController: synthetic requeue (new signer not propagated)
CertificateRevocationController->>TotalClientTrustBundle: trustedCertificates(oldSigner)
TotalClientTrustBundle-->>CertificateRevocationController: count > 0
CertificateRevocationController->>CertificateRevocationController: synthetic requeue (old signer still trusted)
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (2)
1035-1037: ⚡ Quick winConsider adding inline comment and debug logging for observability.
Following the pattern at lines 1061 and 1073, consider adding an inline comment explaining why synthetic requeue is needed here (waiting for TargetConfigController to remove the old signer from the trust bundle), plus debug logging to help diagnose similar stuck conditions in the future.
📝 Suggested improvement for observability
if len(trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert: oldCerts[0]}}, now)) != 0 { + klog.V(4).Infof("Old signer certificate still present in total trust bundle for %s/%s, requeueing", namespace, name) - return true, nil, true, nil + return true, nil, true, nil // synthetic re-queue needed for trust bundle revocation by TargetConfigController }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go` around lines 1035 - 1037, Add an inline comment above the conditional using trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert: oldCerts[0]}}, now) explaining that we return a synthetic requeue here to wait for the TargetConfigController to remove the old signer from the trust bundle; also add a debug log (using the controller's logger) just before returning that records the computed trust length (len(...)) and identifying info about oldCerts[0] (e.g., subject/serial or a stable identifier) so operators can diagnose stuck conditions when this branch fires.
735-737: ⚡ Quick winConsider adding inline comment and debug logging for observability.
Following the pattern at lines 761 and 780, consider adding an inline comment explaining why synthetic requeue is needed here (waiting for TargetConfigController to propagate the new signer to the trust bundle), plus debug logging similar to other requeue conditions in this controller.
📝 Suggested improvement for observability
if len(trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert: signers[0]}}, now)) == 0 { + klog.V(4).Infof("New signer certificate %s/%s not yet propagated to total trust bundle for %s/%s, requeueing", signer.Namespace, signer.Name, namespace, name) - return true, nil, true, nil + return true, nil, true, nil // synthetic re-queue needed for trust bundle propagation by TargetConfigController }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go` around lines 735 - 737, Add an inline comment above the if that uses trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert: signers[0]}}, now) == 0 explaining this is a synthetic requeue to wait for the TargetConfigController to propagate the new signer into the trust bundle; also add a debug log (using the controller's logger, e.g. r.log or processLogger) in the same branch that logs the reason and relevant context (e.g. signer fingerprint/identifier from signers[0], and that trustedCertificates returned 0) so the requeue is observable and consistent with other requeue debug logs in this controller.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 1035-1037: Add an inline comment above the conditional using
trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert:
oldCerts[0]}}, now) explaining that we return a synthetic requeue here to wait
for the TargetConfigController to remove the old signer from the trust bundle;
also add a debug log (using the controller's logger) just before returning that
records the computed trust length (len(...)) and identifying info about
oldCerts[0] (e.g., subject/serial or a stable identifier) so operators can
diagnose stuck conditions when this branch fires.
- Around line 735-737: Add an inline comment above the if that uses
trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert:
signers[0]}}, now) == 0 explaining this is a synthetic requeue to wait for the
TargetConfigController to propagate the new signer into the trust bundle; also
add a debug log (using the controller's logger, e.g. r.log or processLogger) in
the same branch that logs the reason and relevant context (e.g. signer
fingerprint/identifier from signers[0], and that trustedCertificates returned 0)
so the requeue is observable and consistent with other requeue debug logs in
this controller.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f40fd551-96db-4148-97eb-11c0ed360cc5
📒 Files selected for processing (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8563 +/- ##
=======================================
Coverage 40.37% 40.38%
=======================================
Files 755 755
Lines 93227 93227
=======================================
+ Hits 37644 37648 +4
+ Misses 52879 52875 -4
Partials 2704 2704
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (2)
466-466: ⚡ Quick winClarify test name to reflect requeue behavior.
The test name "not yet propagated, nothing to do" is misleading since the controller now requeues (line 470). Consider renaming to follow the "When ... it should ..." pattern and accurately describe the behavior.
📝 Suggested test name
- name: "not yet propagated, nothing to do", + name: "When new signer not yet propagated it should requeue",As per coding guidelines, use "When ... it should ..." format for describing test cases when creating unit tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` at line 466, Update the test case name string "not yet propagated, nothing to do" to accurately reflect that the controller requeues; follow the "When ... it should ..." pattern (e.g., "When CR not yet propagated it should requeue for later processing") so readers understand expected requeue behavior — locate the table-driven test entry that sets name: "not yet propagated, nothing to do" in certificaterevocationcontroller_test.go and replace it with a descriptive "When ... it should ..." phrasing.
851-851: ⚡ Quick winAlign test name with coding guidelines.
Consider renaming this test to follow the "When ... it should ..." pattern for consistency with the coding guidelines and to more clearly indicate the expected behavior.
📝 Suggested test name
- name: "validating, previous still valid", + name: "When old signer still in total bundle it should requeue",As per coding guidelines, use "When ... it should ..." format for describing test cases when creating unit tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` at line 851, Rename the test case name string in the table of test cases inside certificaterevocationcontroller_test.go (the entry currently named "validating, previous still valid") to follow the "When ... it should ..." pattern; for example change it to something like "When validating and previous certificate is still valid it should not revoke" so the test case name used by the test runner (the table entry `name` field) follows the coding guideline and clearly describes expected behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Line 466: Update the test case name string "not yet propagated, nothing to do"
to accurately reflect that the controller requeues; follow the "When ... it
should ..." pattern (e.g., "When CR not yet propagated it should requeue for
later processing") so readers understand expected requeue behavior — locate the
table-driven test entry that sets name: "not yet propagated, nothing to do" in
certificaterevocationcontroller_test.go and replace it with a descriptive "When
... it should ..." phrasing.
- Line 851: Rename the test case name string in the table of test cases inside
certificaterevocationcontroller_test.go (the entry currently named "validating,
previous still valid") to follow the "When ... it should ..." pattern; for
example change it to something like "When validating and previous certificate is
still valid it should not revoke" so the test case name used by the test runner
(the table entry `name` field) follows the coding guideline and clearly
describes expected behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b312a748-4eb5-4172-8613-b06d9e0f64eb
📒 Files selected for processing (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
66b66af to
6a58287
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Line 466: Rename the test case title string in the table-driven tests in
certificaterevocationcontroller_test.go: change the 'name' field value "not yet
propagated, nothing to do" to the Gherkin-style "When not yet propagated it
should do nothing"; apply the same "When ... it should ..." renaming pattern to
the other affected test case mentioned (around the other occurrence referenced)
so all '*_test.go' cases use the required Gherkin format.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 5ab2a0a5-2e7f-4df7-a412-adaea7fccadc
📒 Files selected for processing (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.gocontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
| now: postRevocationClock.Now, | ||
| crrNamespace: "crr-ns", | ||
| crrName: "crr-name", | ||
| name: "not yet propagated, nothing to do", |
There was a problem hiding this comment.
Rename edited test cases to the required Gherkin format.
The updated test titles should follow When ... it should ... to match repo test naming requirements.
Suggested rename
- name: "not yet propagated, nothing to do",
+ name: "When new signer cert is not yet propagated it should requeue",
...
- name: "validating, previous still valid",
+ name: "When previous signer cert is still trusted it should requeue",As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use "When ... it should ..." format for describing test cases when creating unit tests" and "**/*_test.go: Use Gherkin Syntax to define unit test cases with 'When... it should...' pattern".
Also applies to: 851-851
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
at line 466, Rename the test case title string in the table-driven tests in
certificaterevocationcontroller_test.go: change the 'name' field value "not yet
propagated, nothing to do" to the Gherkin-style "When not yet propagated it
should do nothing"; apply the same "When ... it should ..." renaming pattern to
the other affected test case mentioned (around the other occurrence referenced)
so all '*_test.go' cases use the required Gherkin format.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (1)
466-470:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRename updated table-case titles to the required Gherkin format.
At Line 466 and Line 851, the case names still don’t follow
When ... it should .... Please rename both while touching these scenarios.As per coding guidelines, "
**/*_test.{go,ts,tsx,js,jsx}: Always use "When ... it should ..." format for describing test cases when creating unit tests" and "**/*_test.go: Use Gherkin Syntax to define unit test cases with 'When... it should...' pattern".Also applies to: 851-855
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` around lines 466 - 470, Rename the table-driven test case titles (the "name" fields) in certificaterevocationcontroller_test.go from plain phrases to Gherkin-style "When ... it should ..." sentences; specifically update the case with name "not yet propagated, nothing to do" (the entry where now is postRevocationClock.Now and expectedRequeue is true) to something like "When new cert not yet in total bundle it should requeue", and likewise change the other case around the second occurrence (the one mentioned at lines 851-855) to follow the same "When ... it should ..." pattern so all test case names conform to the Gherkin guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Around line 466-470: Rename the table-driven test case titles (the "name"
fields) in certificaterevocationcontroller_test.go from plain phrases to
Gherkin-style "When ... it should ..." sentences; specifically update the case
with name "not yet propagated, nothing to do" (the entry where now is
postRevocationClock.Now and expectedRequeue is true) to something like "When new
cert not yet in total bundle it should requeue", and likewise change the other
case around the second occurrence (the one mentioned at lines 851-855) to follow
the same "When ... it should ..." pattern so all test case names conform to the
Gherkin guideline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 132a6468-5157-46cd-a08c-352458ba0e01
📒 Files selected for processing (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
510bbda to
71a71c2
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (2)
466-466:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTest case name still doesn't follow required Gherkin format.
This test name was already flagged in a previous review but hasn't been updated. As per coding guidelines, test cases must use "When ... it should ..." format.
📝 Suggested rename
- name: "not yet propagated, nothing to do", + name: "When new signer cert not yet in total bundle it should requeue",As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use 'When ... it should ...' format for describing test cases when creating unit tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` at line 466, Rename the failing test case name string "not yet propagated, nothing to do" to follow the required Gherkin format "When ... it should ..."; locate the test table entry in certificaterevocationcontroller_test.go where name: "not yet propagated, nothing to do" appears and replace it with a descriptive Gherkin-style name such as "When revocation is not yet propagated it should do nothing" (or similar) so the test description follows the "When ... it should ..." guideline.
851-851:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTest case name still doesn't follow required Gherkin format.
This test name was already flagged in a previous review but hasn't been updated. As per coding guidelines, test cases must use "When ... it should ..." format.
📝 Suggested rename
- name: "validating, previous still valid", + name: "When old signer cert still in total bundle it should requeue",As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use 'When ... it should ...' format for describing test cases when creating unit tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` at line 851, Rename the test case name value "validating, previous still valid" to follow the required Gherkin-style "When ... it should ..." format (e.g., "When validating and previous certificate is still valid it should ...") by updating the test case's name field in the table-driven test (the entry with name: "validating, previous still valid") so it starts with "When" and includes "it should" describing the expected outcome.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Line 466: Rename the failing test case name string "not yet propagated,
nothing to do" to follow the required Gherkin format "When ... it should ...";
locate the test table entry in certificaterevocationcontroller_test.go where
name: "not yet propagated, nothing to do" appears and replace it with a
descriptive Gherkin-style name such as "When revocation is not yet propagated it
should do nothing" (or similar) so the test description follows the "When ... it
should ..." guideline.
- Line 851: Rename the test case name value "validating, previous still valid"
to follow the required Gherkin-style "When ... it should ..." format (e.g.,
"When validating and previous certificate is still valid it should ...") by
updating the test case's name field in the table-driven test (the entry with
name: "validating, previous still valid") so it starts with "When" and includes
"it should" describing the expected outcome.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0dbdd987-da82-446c-bc9c-5ab1fc34113f
📒 Files selected for processing (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.gocontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
71a71c2 to
a3eaabb
Compare
|
/test e2e-aks |
|
I may be wrong but I the title is somehow misleading The controller already has three mechanisms to re-reconcile when the trust bundle changes:
So I'm not positive we've a RACE here but definitively it's speeding up the revocation mechanism so this PR is safe and it's adding a real value. Thanks! |
@sdminonne , I’ll update the PR description to remove the "race condition" phrasing and focus accurately on the unhandled resync delay causing the timeout. Thanks for your inputs! |
|
/cc @sdminonne |
|
/approve |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Update logic of CertificateRevocationController to request a requeue when encountering stale total-client-ca bundle cache.
63d7067 to
16b595b
Compare
|
/test e2e-conformance |
|
/retitle OCPBUGS-86310: Handle CA bundle aggregation delay by requeuing revocation |
|
Scheduling tests matching the |
|
/test e2e-aws |
Test Resultse2e-aws
e2e-aks
|
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-azure-self-managed |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-azure-self-managed |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryAll 334 e2e tests passed (0 failures, 29 skipped). The pre phase (cluster installation) and test phase (e2e execution) both succeeded. The job was marked as failed solely because the post-phase Root CauseThe failure is a transient CI infrastructure issue, not a product or test bug. Here is the precise chain of events:
This failure has zero relation to PR #8563 (which adds CA bundle aggregation delay handling). The PR's code changes were never exercised during cluster teardown, and all functional tests covering the PR's behavior passed. Recommendations
Evidence
|
|
/test e2e-azure-self-managed |
|
/verified by CI |
|
@YamunadeviShanmugam: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@YamunadeviShanmugam: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@YamunadeviShanmugam: Jira Issue Verification Checks: Jira Issue OCPBUGS-86310 Jira Issue OCPBUGS-86310 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-05-22-041009 |
|
/jira backport release-4.22 |
|
@bryan-cox: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #8746 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
** CA bundle aggregation delay**
During End-to-End (e2e) testing, CertificateRevocationRequest (CRR) successfully processes initial steps such as leaf and root certificate regeneration, but permanently stuck on Condition: PreviousCertificatesRevoked=False Reason: WaitingForAvailable(Previous signer certificate not yet revoked.)
The failure is caused by race condition between CertificateRevocationController and TargetConfigController
Fix:
Modified certificaterevocationcontroller.go to explicitly request a requeue when the old signer certificate has not yet removed from the total trust bundle by TargetConfigController.
Summary by CodeRabbit